Vestnik of Don State Technical University. 2019. Vol. 19, no. 1, pp. 63-73. ISSN 1992-5980 eISSN 1992-6006 
Becmuuk Jloncxozo zocyOapcmeéennozo mexnuyeckozo ynueepcumema. 2019. T. 19, Ne 1. C. 63-73. ISSN 1992-5980 elSSN 1992-6006 





UYVH®OPMATHKA, BhIUHCJINTEJIBHAA 
TEXHHKA XH YITPABJIEHHE 
INFORMATION TECHNOLOGY, COMPUTER 
SCIENCE, AND MANAGEMENT 


UDC 004.932.7271 





https://doi.org/10.23947/1992-5980-2019-19-1-63-73 





Deep convolution neural network model in problem of crack segmentation on asphalt images" 


B. V. Sobol’, A. N. Soloviev’, P. V. Vasiliev’, L. A. Podkolzina* 


23-4 Don State Technical University, Rostov-on-Don, Russian Federation 


Moje riyOokoi cBeprouHol HeipoHHol cern B 3ajja4ue CerMeHTAalMH TpelwH Ha v300paxKeHuAX actalbra 


poe 


B. B. Co6ons', A. H. Conosses’, I. B. Bacuaes’, JI. A. Moxkou3nna** 


1, 2,3,4 


Introduction. Early defect illumination (cracks, chips, etc.) in 
the high traffic load sections enables to reduce the risk under 
emergency conditions. Various photographic and video moni- 
toring techniques are used in the pavement managing system. 
Manual evaluation and analysis of the data obtained may take 
unacceptably long time. Thus, it is necessary to improve the 
conditional assessment schemes of the monitor objects through 
the autovision. 

Materials and Methods. The authors have proposed a model of 
a deep convolution neural network for identifying defects on 
the road pavement images. The model is implemented as an 
optimized version of the most popular, at this time, fully con- 
volution neural networks (FCNN). The teaching selection 
design and a two-stage network learning process considering 
the specifics of the problem being solved are shown. Keras 
and TensorFlow frameworks were used for the software im- 
plementation of the proposed architecture. 

Research Results. The application of the proposed architecture 
is effective even under the conditions of a limited amount of 
the source data. Fine precision is observed. The model can be 
used in various segmentation tasks. According to the metrics, 
FCNN shows the following defect identification results: IoU - 
0.3488, Dice - 0.7381. 

Discussion and Conclusions. The results can be used in the 
monitoring, modeling and forecasting process of the road 


pavement wear. 


Keywords: artificial neural networks, defect identification, 
segmentation, road pavement, cracks, IoU, Dice. 


Jlouckoii rocyyapcTBeHHbIii TeXHH4eCKHH yHuBepcurer, r. Pocros-Ha-Jlony, Poccniickaa Degepauna 


Beedenue. CBoespemMeHHoe ycTpaHeHve edexTos (TpeliHH, 
CKOJIOB H lip.) Ha yuacTKax MOBbILIeCHHOM Harpy3KH JOpoxKHO- 
TO MOJOTHA MO3BOJIAeT CHH3HTb PHCK BO3HHKHOBeCHHA aBa- 
puiinerx curyayuii. B Hacroslyjee BpeMA IJld KOHTPOJIA COCTO- 
AHHA TOPOXKHOTO MOKPbITHA IPHMeCHAIOTCA pa3sIM4HbIe MeTO- 
bl (POTO- HU BUAecOHAaOMOAeHHA. OleHKa HW aHasM3 MoyueH- 
HBIX J@HHBIX B PyYHOM PexKHMe MOTYT 3aHATh HeAOMYCTHMO 
MHOrO BpeMeHH. TakKHM oOpa30M, HeoOXOqMMO coBeplcH- 
CTBOBaTb IIpoleypbl OCMOTpa HM OLeCHKH COCTOAHHA OOBECKTOB 
KOHTPOJIA C MOMOLMI[bIO TEXHHYECKOLO 3peHHaA. 

Mamepuaavi u memoovi. ABTopaMu TipesIoxKeHa MOesb Tty- 
OokoH cBepTOuHOM HelpoHHOM ceTH Wd HeHTHpUKaLHH 
WedbekToB Ha W300paxKeHuAx WopoxHoro noKpEiTua. Moje 
peamm30BaHa KaK ONTHMV3HpOBaHHbI BapvaHT HaHOosee 
TIOMYIAIpHBIX Ha JaHHbI] MOMCHT TIOJIHOCTbIO CBePTOUHBIX 
HelpOHHBbIx ceTeli (FCNN). Hloxa3aHo moctpoeHue o6yyaro- 
We BEIOOPKH MU ABYX3TaMHbI Mmpolecc oOyyeHuA ceTH Cc y4e- 
TOM ClelMpuKH pellaemol 3aqaun. Ja mporpaMMHol pea- 
IM3alHH lpeqOxKeHHOH apXHTeKTypbl UCTOb30BasIMCb 
cpeiimBopxu Keras u TensorFlow. 

Pesyiesmame uccredoeanua. UpumMenenve mpeqioxeHHor 
apXxHTeKTypbI 39:pdbeKTHBHO JaxKe B yCNOBHAX OrpaHHueHHOro 
oObeMa HCXOHBIX JaHHbIxX. OTMeyeHa BbICOKad CTeMeHb T0- 
BTOPACMOCTH pe3ybTATOB. 

Mojerb MoxeT ObITbh HCTONb30BaHa B pa3JIM4HBIX 3aqayax 
cerMeHtayMu. CormacHo Metpukam, FCNN noka3piBaeT cile- 
Hylouve pe3yibTaTbl ueHTHpUKayHMH Yedextos: IloU — 
0,3488, Dice — 0,7381. 

O6cyarcdenue u 3axmouenua. HomyaeHuble pe3yIbTaTbl MOTyT 


ObITb HCIOJIb3OBaHbI B Tipowecce MOHHTOPHHTa, MOCJIMpOBa- 
HHA HW NpOrHOSHpOBaHHA IIpPOWeCCOB H3HOCa JOPO2KHBIX IT0- 


KpbITHi. 


K.unoveBbie C10Ba: HCKYCCTBeCHHbIe HeMpoOHHble CeTH, HeH- 
THpuKalMA WedeKTOB, CerMeHTAallHA, TOPoOxKHOe MOKpsITHe, 
TpemuHst, IoU, Dice. 
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Introduction. The road pavement wear requires regular monitoring. Efficient monitoring strategies allow for 
the timely detection of problem areas. This approach improves significantly the efficiency of road maintenance, reduces 
maintenance costs and provides continuous operation. Technologies for identifying critical features of the condition of 
the road surface have developed from the manual photofixation methods to the high-speed digital technology [1]. 

Russia is one of the five countries with the longest automobile roads. To provide photo- and video monitoring 
of such a large-scale infrastructure special systems are required. They should be reliable and easy-to-use on-line sys- 
tems. It seems clear that in this case it is not about solutions that involve the data analysis in manual mode. Such an 
approach is unacceptable due to the considerable time spent on information processing and low quality of the analysis. 

The authors of the paper propose a new technological solution in the field of machine learning. Its implementa- 
tion enables to automate the process of road surface quality assessment. To this end, the convolution neural network is 
trained on the data marked up manually. The system learns to recognize and evaluate the basic failure modes of the 
monitor objects by this means. 

Research Digest. Many papers describe the improvement of algorithms for detecting defects in the structural 
components and infrastructure facilities. To solve this problem, computer vision capabilities are widely used. Their con- 
tinuous improvement is supported by the development of sensing technologies, hardware and software. However, it 
should be recognized that at present, computer vision is used sparingly. This is due to many factors, including: 

- heterogeneity of defects, 

- variety of types of surfaces, 

- complexity of the background, 

- junctions. 

The authors of a number of publications investigate automated methods for detecting cracks in images and 
propose their own solutions [2-9]. Some works consider the specifics of monitoring objects of road transportation infra- 
structure [10, 11], as well as bridges and structural systems [12, 13]. 

Until recently, mostly manual monitoring techniques were used to solve these problems, such as: 

- morphological operations [13], 

- analysis of geometric features [6], 

- application of Gabor filters [14], 

- wavelet transforms [15], 

- building of histograms of oriented gradients (HOG) [16], 

- texture analysis, 

- machine learning [4]. 

However, currently, the listed tool is used more and more seldom. It is replaced by a global spread of neural net- 
work technologies and machine learning supported by the computational capabilities of graphic processors. 

A convolution neural network (CNN) is a multi-layered artificial neural network architecture designed specifi- 
cally for imaging [17]. In this case, a subsampling layer (pooling layer) allows local capable fields to be realized 
through convolution layers and invariance with respect to small geometric strains. 

This architecture demonstrates outstanding results in solving the following recognition problems: 

- handwritten numbers [18], 

- house numbers based on the Google Street View house number (SVHN) dataset [19], 

- road signs [20]. 

The computation power rise of graphics processors allows for the use of deeper architectures of machine learn- 
ing models [21]. Now it is possible to avoid retraining [22]. This is facilitated by the development of such modern tech- 


niques as data augmentation, regularization, etc. 
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Improving convolution neural networks opens up the possibility of more efficient study and generalization of 
the features of images (for example, image classification [23], object search [24], vehicle detection [25]). 

The flexibility and prospects of deep learning for the tasks of automatic detection of pavement cracks is shown 
in [26, 27]. 

In [28], the application of neural networks for the automatic detection and classification of cracks in asphalt is 
considered. The authors propose to use the universe mean and dispersion of the grayscale values. With consideration to 
these indicators, the image is divided into fragments, and then each cell is classified as a crack. Motivation for using the 
fall weight deflectometers (FWD) to assess asphalt cracks was shown. In 98% of cases, the system detects effectively a 
crack in the image. 

In [29], the use of a neural network for the detection of defects was investigated. The advantages of the method 
of clustering pixels as objects were clarified. It enables to increase identification accuracy and reduce noise. 

In [30], the authors used a deep learning architecture that includes the VGG-16 model. It was previously 
trained to identify features that allow distinguishing between classes of images. The model demonstrated excellent 
recognition quality even under working with images from the areas unknown to it.CNN VGG-16 was used as a deep 
feature generator of pavement images. The authors taught only the last layer of the classifier. They experimented with 
different models of machine learning, and showed their strengths and weaknesses. 

In [31], the use of CNN in an applied robotics problem is shown. It is referred to the autonomous detection and 
assessment of cracks and damages in the sewer pipe. CNN filters the data and localizes the cracks, which provides ob- 
taining a characteristic of their geometrics. 

The objective of [32] is to automate the sequential detection of chips and to give numeric representation of 
damage in metro networks. For that, an integrated model that implements a hybrid algorithm and an interactive 3D rep- 
resentation is created. Chipping depth prediction is supported by regression analysis. 

The paper [33] provides an overview and assessment of promising approaches that automatically detect cracks 
and corrosion in the civil infrastructure systems. 

In [34], an effective CNN-based architecture for detecting pavement cracks on a three-dimensional asphalt sur- 
face is described. The CrackNet architecture provides high precision of the data processing through an ingenious meth- 
od of representing the road surface geometry. CrackNet consists of five layers and includes more than a million of 
trainable parameters. The experiments using 200 test 3D-images have shown that the CrackNet accuracy can reach 
90.13%. 

Offered Method. To identify defects in pavement images, it is necessary to determine what a defect is and 
what is not. In other words, the image segmentation should be carried out, and the appropriate classes should be identi- 
fied. Recently, this type of problem has been effectively solved through purposely developed architectures of the convo- 
lution neural networks, such as SegNet [35] and U-Net [36]. 

The specifics of pavement images involve a small range of gray shades and a minor difference between the 
background and the target object. Also, the task is complicated by noise, defects and the occurrence of foreign objects 
in the image. 

Various data sets are used to learn neural networks [7, 37]. These sets include original pavement images and 
their corresponding mask images with or without defects. Images with defects on the road surface are specific, so the 
authors offer their own simplified model of a deep convolution neural network. For image segmentation, a fully convo- 
lution neural network (FCNN) [38] with an encoder-decoder structure is proposed. The image of the road surface is fed 
to the system input, and the output is a binary image. A segmented image showing the presence or absence of defects is 
resulted. 

Architecture of Deep Convolution Neural Network. Fig. 1 shows the architecture of the proposed deep con- 


volution neural network. 
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Fig. 1. Architecture of proposed network 


The neural network consists of two parts, convolution and deconvolution. The convolution part converts the 
input image into a multi-dimensional representation of features. In other words, it performs the function of extracting 
features. The deconvolution network serves as a generator, which creates a segmented image based on the characteris- 
tics obtained from the convolution network. The last convolution network layer with a sigmoid activation function gen- 
erates a segmented image — a map of probabilities of the defect of the same size as that in the input picture. 

The first part of the network consists of five convolution layers with filter sets (256, 128, 64, 64, 64). The 
“batch normalization” (BN) tool [39] is used. ReLU, rectified linear unit, is used as the activation functions. Then, of 
subsampling (pooling) layers with a 2x2 window follow. In passing through this layer, the image is decreased by half. 
The second part of the network is a mirror image of the first. The image size needs to be restored to the original one and 
to form a probability map based on the input image features. To this end, upsampling layers are used in combination 
with the convolution layers. The proposed neural network has 10 convolution layers and 929665 trainable parameters. 

Dataset Generation. The CrackForest data set [7] is used for training the constructed model. Its augmentation 
(artificial increase of the dataset) is carried out, since the training and operation of the neural network is based on the 
path-based approach, which involves the use of randomly clipping elements of the original images. 

So, the data set consists of 117 images. It is divided into training, test and validation samples. Fragments of 
64x64 are randomly selected for each image of the training and test samples. The studies have shown that gamma cor- 
rection of images improves the neural network quality within the framework of the task. Each image fragment under- 
goes rotation, reflection, and deformation. The optimal ratio of fragments with and without defect was established as 
95% to 5%. At this, defects that occupy at least 5% of the image area are considered. The sample size affects the learn- 
ing process and the network quality. The optimal ratio was established as follows: 15 200 fragments of the training 
sample and 3 968 — of the test one. Fig. 2 shows the images and the corresponding masks used for training the neural 


network. 





40 


Fig. 2. Images and binary masks obtained from data augmentation 
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Neural Network Training. Intersection metrics between two detections (intersection over union, IoU, Jaccard 
index) and an equivalent binary similarity measure (Sérensen-Dice coefficient) are used to train and evaluate the neural 


network. The function 1 — J is used as cost: 
JA, B) = 


The initialization of weights in the layers of the neural network is carried out through the Glorot method [40]. 
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To reduce the internal covariance shift, the batch is normalized through normalizing the input distributions of each lay- 
er. Adam-algorithm (stochastic optimization method) is used for learning [41]. 

At the first stage, the neural network is trained on a small amount of data (30% of the core set) during 5 
epochs. At the second stage, the network is trained on the full amount of data during the required number of epochs. 


The learning rate varies with each epoch according to the established dependence (Fig. 3). 
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Fig. 3. Variation of network learning rate depending on epoch 


As part of the task, it was established that the optimal number of training epochs is 25 (5 epochs at the first 
stage of training, and 20 — at the second one). With more epochs, the neural network accuracy did not change crucially. 

Keras and TensorFlow frameworks were used for the implementation of the developed architecture of deep 
CNN. 

Research Results. After training the neural network, it is validated on test data. Each image fragment is fed to 
the network input, and the output is a generated map of the defect probabilities. Fig. 4 shows the results of the trained 


network performance and their comparison with the true values from the test sample. 
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Fig. 4. Results of trained neural network performance 


In Fig. 4, the data is produced in four columns: 

1) image under study, 

2) neural network output, 

3) manually identified defect, 

4) difference between 2) and 3). 

Assume that the networks are compared with the true values. The values of IoU and Dice metrics are condi- 
tioned by the specific ratios of the following factors: 

- area of the defect and area of the whole image, 
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- binary (one-bit) mask and real (4-byte) generated image. 
It should be noted that when using IoU for defectless fragments, the metric values are 0 (Fig. 5). 


Network predict accuracy histogram 
Average accuracy (Dice): 73.81%, (loU): 34.88% 
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Fig. 5. Number of images identified with specified degree of accuracy 


The prepared data set quality affects drastically the learning and the neural network output. In some cases, the 
neural network indicates a defect, although the true image does not have it, or vice versa. This affects the overall as- 
sessment of the model quality. In general, the accuracy evaluation of the neural network according to the proposed met- 
rics can be subjective, so you should not take the data from Fig. 4 as absolute. 

Some models of FCN networks are evaluated in the framework of the presented paper. The results are shown 
in Table 1. 




















Table 1 
Accuracy of some neural networks 
Network Architecture Accuracy 
10 layers (256, 128, 64, 64, 64, ...), 929 665 parameters Dice: 73.81 %, IoU: 34.88 % 
16 layers (32, 32, 16, 16, 16, 8, 8, 8, ...), 43 441 parameters Dice: 70.40 %, IoU: 33.24 % 
12 layers (32, 32, 16, 16, 8, 8, ...), 37 537 parameters Dice: 67.57 %, IoU: 32.12 % 








Here, the number of filters on the first part of the network is indicated in brackets. The number of filters on the 
second part of the network is flipped (see Fig. 1). 

A sliding window method with a specified step that regulates the processing speed and detailing is used to pro- 
cess high-resolution images. This is how the resulting defect probability map for the whole image is developed. Several 
images from the validation set and the result of their processing by the neural network are shown in Fig. 6. 
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Fig. 6. Validation images processed by trained FCN 


Discussion and Conclusions. A model of deep convolution neural network is proposed for identifying defects 
in pavement images. The model is implemented as a simplified and optimized version of the most popular on-the-day 
FCN networks. The techniques for constructing a training set and a two-stage network learning process considering the 
specifics of the current problem are described. The work done shows that the use of such architectures is successful 
with a small amount of the input data. Fine precision is registered. The described model can be used in various segmen- 
tation tasks. According to the metrics, FCNN shows the following results: IoU - 0.3488, Dice - 0.7381. 
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